19 research outputs found

    Beehive: an FPGA-based multiprocessor architecture

    Get PDF
    In recent years, to accomplish with the Moore's law hardware and software designers are tending progressively to focus their efforts on exploiting instruction-level parallelism. Software simulation has been essential for studying computer architecture because of its flexibility and low cost. However, users of software simulators must choose between high performance and high fidelity emulation. This project presents an FPGA-based multiprocessor architecture to speed up multiprocessor architecture research and ease parallel software simulation

    Beehive: an FPGA-based multiprocessor architecture

    Get PDF
    In recent years, to accomplish with the Moore's law hardware and software designers are tending progressively to focus their efforts on exploiting instruction-level parallelism. Software simulation has been essential for studying computer architecture because of its flexibility and low cost. However, users of software simulators must choose between high performance and high fidelity emulation. This project presents an FPGA-based multiprocessor architecture to speed up multiprocessor architecture research and ease parallel software simulation

    HATCH: Hash Table Caching in Hardware for Efficient Relational Join on FPGA

    Get PDF
    In this paper we present HATCH, a novel hash join engine. We follow a new design point which enables us to effectively cache the hash table entries in fast BRAM resources, meanwhile supporting collision resolution in hardware. HATCH enables us to have the best of two worlds: (i) to use the full capacity of the DDR memory to store complete hash tables, and (ii) by employing a cache, to exploit the high access speed of BRAMs. We demonstrate the usefulness of our approach by running hash join operations from 5 TPCH benchmark queries and report speedups up to 2.8x over a pipeline-optimized baseline.The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013), for Advanced Analytics for Extremely Large European Databases (AXLE) project under grant agreement number 318633, and from the Ministry of Economy and Competitiveness of Spain under contract number TIN2012-34557.Postprint (author's final draft

    AxleDB: A novel programmable query processing platform on FPGA

    Get PDF
    With the rise of Big Data, providing high-performance query processing capabilities through the acceleration of the database analytic has gained significant attention. Leveraging Field Programmable Gate Array (FPGA) technology, this approach can lead to clear benefits. In this work, we present the design and implementation of AxleDB: An FPGA-based platform that enables fast query processing for database systems by melding novel database-specific accelerators with commercial-off-the-shelf (COTS) storage using modern interfaces, in a novel, unified, and a programmable environment. AxleDB can perform a large subset of SQL queries through its set of instructions that can map compute-intensive database operations, such as filter, arithmetic, aggregate, group by, table join, or sort, on to the specialized high-throughput accelerators. To minimize the amount of SSD I/O operations required, AxleDB also supports hardware MinMax indexing for databases. We evaluated AxleDB with five decision support queries from the TPC-H benchmark suite and achieved a speedup from 1.8X to 34.2X and energy efficiency from 2.8X to 62.1X, in comparison to the state-of-the-art DBMS, i.e., PostgreSQL and MonetDB.The research leading to these results has received funding from the European Union Seventh Framework Program (FP7) (under the AXLE project GA number 318633), the Ministry of Economy and Competitiveness of Spain (under contract number TIN2015-65316-p), Turkish Ministry of Development TAM Project (number 2007K120610), and Bogazici University Scientific Projects (number 7060).Peer ReviewedPostprint (author's final draft

    Accelerating Hash-Based Query Processing Operations on FPGAs by a Hash Table Caching Technique

    Get PDF
    Extracting valuable information from the rapidly growing field of Big Data faces serious performance constraints, especially in the software-based database management systems (DBMS). In a query processing system, hash-based computational primitives such as the hash join and the group-by are the most time-consuming operations, as they frequently need to access the hash table on the high-latency off-chip memories and also to traverse whole the table. Subsequently, the hash collision is an inherent issue related to the hash tables, which can adversely degrade the overall performance. In order to alleviate this problem, in this paper, we present a novel pure hardware-based hash engine, implemented on the FPGA. In order to mitigate the high memory access latencies and also to faster resolve the hash collisions, we follow a novel design point. It is based on caching the hash table entries in the fast on-chip Block-RAMs of FPGA. Faster accesses to the correspondent hash table entries from the cache can lead to an improved overall performance. We evaluate the proposed approach by running hash-based table join and group-by operations of 5 TPC-H benchmark queries. The results show 2.9×–4.4× speedups over the cache-less FPGA-based baseline.The research leading to these results has received funding from the European Union’s Seventh Framework Program (FP7/2007-2013), for Advanced Analytics for Extremely Large European Databases (AXLE) project under grant agreement number 318633, and from the Ministry of Economy and Competitiveness of Spain under contract number TIN2015-65316-p.Peer ReviewedPostprint (author's final draft

    An empirical evaluation of High-Level Synthesis languages and tools for database acceleration

    Get PDF
    High Level Synthesis (HLS) languages and tools are emerging as the most promising technique to make FPGAs more accessible to software developers. Nevertheless, picking the most suitable HLS for a certain class of algorithms depends on requirements such as area and throughput, as well as on programmer experience. In this paper, we explore the different trade-offs present when using a representative set of HLS tools in the context of Database Management Systems (DBMS) acceleration. More specifically, we conduct an empirical analysis of four representative frameworks (Bluespec SystemVerilog, Altera OpenCL, LegUp and Chisel) that we utilize to accelerate commonly-used database algorithms such as sorting, the median operator, and hash joins. Through our implementation experience and empirical results for database acceleration, we conclude that the selection of the most suitable HLS depends on a set of orthogonal characteristics, which we highlight for each HLS framework.Peer ReviewedPostprint (author’s final draft

    From plasma to beefarm: Design experience of an FPGA-based multicore prototype

    Get PDF
    In this paper, we take a MIPS-based open-source uniprocessor soft core, Plasma, and extend it to obtain the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in the FPGA and the computer architecture communities. We discuss various design tradeoffs and we demonstrate superior scalability through experimental results compared to traditional software instruction set simulators. Based on our experience of designing and building a complete FPGA-based multiprocessor emulation system that supports run-time and compiler infrastructure and on the actual executions of our experiments running Software Transactional Memory (STM) benchmarks, we comment on the pros, cons and future trends of using hardware-based emulation for research.Peer ReviewedPostprint (author's final draft

    Multicore architecture prototyping on reconfigurable devices

    Get PDF
    In the the last decades several performance walls were hit. The memory wall and the power wall are limiting the performance scaling of digital microprocessors. Homogeneous multicores rely on thread-level parallelism, which is challenging to exploit. New heterogeneous architectures promise higher performance per watt rates, but software simulators have limited capacity to research them. In this thesis we investigate the advantages of Field-Programmable Gate Array devices (FPGA) for multicore research. We developed three prototypes, implementing up to 24 cores in a single FPGA, showing their superior performance and precision compared to software simulators. Moreover, our prototypes perform full-system emulation and are totally modifiable. We use our prototypes to implement novel architectural extensions such as Transactional Memory (TM). This use case allowed us to research different needs that computer architects may have, and how to implement them on FPGAs. We developed several techniques to offer profiling, debugging and verification techniques in each stage of the design process. These solutions may bridge the gap between FPGA-based hardware design and computer architects. In particular, we place a special stress on non-obtrusive techniques, so that the precision of the emulation is not affected. Based on the current trends and the sustained growth in the high-level synthesis community, we expect FPGAs to become an integral part of computer architecture design in the next years.En les darreres dècades s'ha xocat amb diversos "murs" de rendiment. El mur de la memòria i el mur de la potència estan limitant l'escalat del rendiment dels microprocessadors digitals. Els multiprocessadors homogenis depenen del paral·lelisme a nivell de fil d'execució, el qual és difícil de rendibilitzar. Les noves arquitectures heterogènies prometen més rendiment per watt, però els simuladors de software tenen una capacitat limitada per investigar-les. En aquesta tesi investiguem els avantatges de les Field-Programmable Gate Arrays (FPGA, o matrius de portes programables in situ) per a investigació sobre multiprocessadors. Hem desenvolupat tres prototipus que implementen fins a 24 nuclis en una sola FPGA, mostrant els seus rendiment i precisió superiors als dels simuladors de software. A més, els nostres prototipus realitzen emulació del sistema complet i són completament modificables. Hem utilitzat els nostres prototipus per implementar noves extensions arquitectòniques com la Transactional Memory (TM, o memòria transaccional). Aquest cas d'ús ens ha permès investigar les necessitats que els arquitectes de computadors poden tenir, i com implementar-les en una FPGA. Hem desenvolupat diverses tècniques que ofereixen traces, depuració i verificació en cada etapa del procés de disseny. Aquestes solucions poden reduir l'escletxa entre els simuladors basats en FPGA i els arquitectes de computadors. En concret, hem posar especial èmfasi en tècniques que no interfereixin amb l'execució, per tal que la precisió de la simulació no es vegi afectada. A partir de les tendències actuals i l'augment sostingut de la comunitat de síntesis d'alt nivell, preveiem que les FPGA esdevindran una part integral del disseny d'arquitectures de computadors en els propers anys.Postprint (published version

    Beehive: an FPGA-based multiprocessor architecture

    No full text
    In recent years, to accomplish with the Moore's law hardware and software designers are tending progressively to focus their efforts on exploiting instruction-level parallelism. Software simulation has been essential for studying computer architecture because of its flexibility and low cost. However, users of software simulators must choose between high performance and high fidelity emulation. This project presents an FPGA-based multiprocessor architecture to speed up multiprocessor architecture research and ease parallel software simulation

    HATCH: Hash Table Caching in Hardware for Efficient Relational Join on FPGA

    No full text
    In this paper we present HATCH, a novel hash join engine. We follow a new design point which enables us to effectively cache the hash table entries in fast BRAM resources, meanwhile supporting collision resolution in hardware. HATCH enables us to have the best of two worlds: (i) to use the full capacity of the DDR memory to store complete hash tables, and (ii) by employing a cache, to exploit the high access speed of BRAMs. We demonstrate the usefulness of our approach by running hash join operations from 5 TPCH benchmark queries and report speedups up to 2.8x over a pipeline-optimized baseline.The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013), for Advanced Analytics for Extremely Large European Databases (AXLE) project under grant agreement number 318633, and from the Ministry of Economy and Competitiveness of Spain under contract number TIN2012-34557
    corecore